Goto

Collaborating Authors

 computing machinery




Evaluating the Ability of Explanations to Disambiguate Models in a Rashomon Set

Rawal, Kaivalya, Delaney, Eoin, Fu, Zihao, Wachter, Sandra, Russell, Chris

arXiv.org Machine Learning

Explainable artificial intelligence (XAI) is concerned with producing explanations indicating the inner workings of models. For a Rashomon set of similarly performing models, explanations provide a way of disambiguating the behavior of individual models, helping select models for deployment. However explanations themselves can vary depending on the explainer used, and need to be evaluated. In the paper "Evaluating Model Explanations without Ground Truth", we proposed three principles of explanation evaluation and a new method "AXE" to evaluate the quality of feature-importance explanations. We go on to illustrate how evaluation metrics that rely on comparing model explanations against ideal ground truth explanations obscure behavioral differences within a Rashomon set. Explanation evaluation aligned with our proposed principles would highlight these differences instead, helping select models from the Rashomon set. The selection of alternate models from the Rashomon set can maintain identical predictions but mislead explainers into generating false explanations, and mislead evaluation methods into considering the false explanations to be of high quality. AXE, our proposed explanation evaluation method, can detect this adversarial fairwashing of explanations with a 100% success rate. Unlike prior explanation evaluation strategies such as those based on model sensitivity or ground truth comparison, AXE can determine when protected attributes are used to make predictions.


ImageTalk: Designing a Multimodal AAC Text Generation System Driven by Image Recognition and Natural Language Generation

Yang, Boyin, Jiang, Puming, Kristensson, Per Ola

arXiv.org Artificial Intelligence

People living with Motor Neuron Disease (plwMND) frequently encounter speech and motor impairments that necessitate a reliance on augmentative and alternative communication (AAC) systems. This paper tackles the main challenge that traditional symbol-based AAC systems offer a limited vocabulary, while text entry solutions tend to exhibit low communication rates. To help plwMND articulate their needs about the system efficiently and effectively, we iteratively design and develop a novel multimodal text generation system called ImageTalk through a tailored proxy-user-based and an end-user-based design phase. The system demonstrates pronounced keystroke savings of 95.6%, coupled with consistent performance and high user satisfaction. We distill three design guidelines for AI-assisted text generation systems design and outline four user requirement levels tailored for AAC purposes, guiding future research in this field.


When AI Gives Advice: Evaluating AI and Human Responses to Online Advice-Seeking for Well-Being

Kumar, Harsh, Chahal, Jasmine, Zhao, Yinuo, Zhang, Zeling, Wei, Annika, Tay, Louis, Anderson, Ashton

arXiv.org Artificial Intelligence

Seeking advice is a core human behavior that the Internet has reinvented twice: first through forums and Q\&A communities that crowdsource public guidance, and now through large language models (LLMs) that deliver private, on-demand counsel at scale. Yet the quality of this synthesized LLM advice remains unclear. How does it compare, not only against arbitrary human comments, but against the wisdom of the online crowd? We conducted two studies (N = 210) in which experts compared top-voted Reddit advice with LLM-generated advice. LLMs ranked significantly higher overall and on effectiveness, warmth, and willingness to seek advice again. GPT-4o beat GPT-5 on all metrics except sycophancy, suggesting that benchmark gains need not improve advice-giving. In our second study, we examined how human and algorithmic advice could be combined, and found that human advice can be unobtrusively polished to compete with AI-generated comments. Finally, to surface user expectations, we ran an exploratory survey with undergraduates (N=148) that revealed heterogeneous, persona-dependent preferences for agent qualities (e.g., coach-like: goal-focused structure; friend-like: warmth and humor). We conclude with design implications for advice-giving agents and ecosystems blending AI, crowd input, and expert oversight.


Mental Models of Autonomy and Sentience Shape Reactions to AI

Pauketat, Janet V. T., Shank, Daniel B., Manoli, Aikaterina, Anthis, Jacy Reese

arXiv.org Artificial Intelligence

Narratives about artificial intelligence (AI) entangle autonomy, the capacity to self-govern, with sentience, the capacity to sense and feel. AI agents that perform tasks autonomously and companions that recognize and express emotions may activate mental models of autonomy and sentience, respectively, provoking distinct reactions. To examine this possibility, we conducted three pilot studies (N = 374) and four preregistered vignette experiments describing an AI as autonomous, sentient, both, or neither (N = 2,702). Activating a mental model of sentience increased general mind perception (cognition and emotion) and moral consideration more than autonomy, but autonomy increased perceived threat more than sentience. Sentience also increased perceived autonomy more than vice versa. Based on a within-paper meta-analysis, sentience changed reactions more than autonomy on average. By disentangling different mental models of AI, we can study human-AI interaction with more precision to better navigate the detailed design of anthropomorphized AI and prompting interfaces.


PinRec: Outcome-Conditioned, Multi-Token Generative Retrieval for Industry-Scale Recommendation Systems

Agarwal, Prabhat, Badrinath, Anirudhan, Bhasin, Laksh, Yang, Jaewon, Botta, Edoardo, Xu, Jiajing, Rosenberg, Charles

arXiv.org Artificial Intelligence

Generative retrieval methods utilize generative sequential modeling techniques, such as transformers, to generate candidate items for recommender systems. These methods have demonstrated promising results in academic benchmarks, surpassing traditional retrieval models like two-tower architectures. However, current generative retrieval methods lack the scalability required for industrial recommender systems, and they are insufficiently flexible to satisfy the multiple metric requirements of modern systems. This paper introduces PinRec, a novel generative retrieval model developed for applications at Pinterest. PinRec utilizes outcome-conditioned generation, enabling modelers to specify how to balance various outcome metrics, such as the number of saves and clicks, to effectively align with business goals and user exploration. Additionally, PinRec incorporates multi-token generation to enhance output diversity while optimizing generation. Our experiments demonstrate that PinRec can successfully balance performance, diversity, and efficiency, delivering a significant positive impact to users using generative models. This paper marks a significant milestone in generative retrieval, as it presents, to our knowledge, the first rigorous study on implementing generative retrieval at the scale of Pinterest.


GuideNav: User-Informed Development of a Vision-Only Robotic Navigation Assistant For Blind Travelers

Hwang, Hochul, Yang, Soowan, Monon, Jahir Sadik, Giudice, Nicholas A, Lee, Sunghoon Ivan, Biswas, Joydeep, Kim, Donghyun

arXiv.org Artificial Intelligence

While commendable progress has been made in user-centric research on mobile assistive systems for blind and low-vision (BLV) individuals, references that directly inform robot navigation design remain rare. To bridge this gap, we conducted a comprehensive human study involving interviews with 26 guide dog handlers, four white cane users, nine guide dog trainers, and one O\&M trainer, along with 15+ hours of observing guide dog-assisted walking. After de-identification, we open-sourced the dataset to promote human-centered development and informed decision-making for assistive systems for BLV people. Building on insights from this formative study, we developed GuideNav, a vision-only, teach-and-repeat navigation system. Inspired by how guide dogs are trained and assist their handlers, GuideNav autonomously repeats a path demonstrated by a sighted person using a robot. Specifically, the system constructs a topological representation of the taught route, integrates visual place recognition with temporal filtering, and employs a relative pose estimator to compute navigation actions - all without relying on costly, heavy, power-hungry sensors such as LiDAR. In field tests, GuideNav consistently achieved kilometer-scale route following across five outdoor environments, maintaining reliability despite noticeable scene variations between teach and repeat runs. A user study with 3 guide dog handlers and 1 guide dog trainer further confirmed the system's feasibility, marking (to our knowledge) the first demonstration of a quadruped mobile system retrieving a path in a manner comparable to guide dogs.


Modeling User Preferences as Distributions for Optimal Transport-based Cross-domain Recommendation under Non-overlapping Settings

Xiao, Ziyin, Suzumura, Toyotaro

arXiv.org Artificial Intelligence

Cross-domain recommender (CDR) systems aim to transfer knowledge from data-rich domains to data-sparse ones, alleviating sparsity and cold-start issues present in conventional single-domain recommenders. However, many CDR approaches rely on overlapping users or items to establish explicit cross-domain connections, which is unrealistic in practice. Moreover, most methods represent user preferences as fixed discrete vectors, limiting their ability to capture the fine-grained and multi-aspect nature of user interests. To address these limitations, we propose DUP-OT (Distributional User Preferences with Optimal Transport), a novel framework for non-overlapping CDR. DUP-OT consists of three stages: (1) a shared preprocessing module that extracts review-based embeddings using a unified sentence encoder and autoencoder; (2) a user preference modeling module that represents each user's interests as a Gaussian Mixture Model (GMM) over item embeddings; and (3) an optimal-transport-based alignment module that matches Gaussian components across domains, enabling effective preference transfer for target-domain rating prediction. Experiments on Amazon Review datasets demonstrate that DUP-OT mitigates domain discrepancy and significantly outperforms state-of-the-art baselines under strictly non-overlapping training settings, with user correspondence revealed only for inference-time evaluation.


SystolicAttention: Fusing FlashAttention within a Single Systolic Array

Lin, Jiawei, Li, Yuanlong, Chen, Guokai, Bourgeat, Thomas

arXiv.org Artificial Intelligence

Transformer models rely heavily on the scaled dot-product attention (SDPA) operation, typically implemented as FlashAttention. Characterized by its frequent interleaving of matrix multiplications and softmax operations, FlashAttention fails to fully utilize the compute resources of modern systolic-array-based accelerators designed for consecutive and large matrix multiplications. To fully unleash the performance potential of systolic arrays for FlashAttention, we propose FSA, an enhanced systolic array architecture that runs the entire FlashAttention on the array without external vector units. Combined with SystolicAttention, an optimized kernel for FSA that achieves fine-grained and element-wise overlapping of FlashAttention operations, FSA maximizes array utilization while preserving the original floating-point operation order of FlashAttention. We implement FSA in synthesizable RTL and evaluate its performance against state-of-the-art systolic-array-based accelerators. Our results show that FSA achieves 1.77x and 4.83x higher attention FLOPs/s utilization compared to AWS Neuron-v2 and Google TPUv5e, respectively. We synthesize FSA in a 16 nm technology at 1.5 GHz, and results indicate only a 12% area overhead compared to a standard weight-stationary systolic array.